-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: PushDownFilter for GROUP BY on uppercase col names
#16049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PushDownFilter for GROUP BY on uppercase col names
0495f24 to
14ceef8
Compare
|
@adriangb Saw you recent commits in this area, would appreciate if you weighed in on this. Thank you! 🙌 |
|
Hmm I've been doing a lot with the physical optimizer of the same name but haven't touched the logical optimizer. The recent changes may mean that the pushdown ends up happening regardless at the physical level but I think it's worth fixing the logical level anyway. I don't fully understand the issue: does |
Sorry for not being more clear. I was referring to these lines: datafusion/datafusion/common/src/utils/mod.rs Lines 297 to 298 in 923bfb7
Reading it made me think that if I used quotes I might convince it to remain unchanged, but it still converts to lowercase |
|
Can we please get a test for this fix so we don't break it again in the future? |
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
4df8ec6 to
ba36d39
Compare
|
@alamb Thanks for waiting, added a test that would break without this change |
xudong963
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also suggest adding sqllogictest based on the sql in PR summary
|
|
||
| #[test] | ||
| fn filter_agg_case_insensitive() -> Result<()> { | ||
| let table_scan = test_table_scan_with_uppercase_columns()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the table also has a column named 'a', what'll happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question, just tried this and it works as expected for both uppercase and lower case col, even if both are present in the schema at the same time. I added another test, lmk if we should keep it or it's overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also happy to see a related sqllogictest
I'd be happy to, can you please point me at a sample patch or a good suite to add to? Last time I tried there was quite a bit of ceremony around SLT, not sure if I can get it right on first approach. Thanks! |
|
Here are the instructions: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest Ideally you should be able to extend one of the existing test files in https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files |
4082863 to
e335907
Compare
|
@xudong963 @alamb apologies for the delayed update, I was out on vacation for most of July. I added 2 slt tests and squashed/rebased - should be ready for merge now |
| 05)--------AggregateExec: mode=Partial, gby=[A@0 as A], aggr=[sum(test_uppercase_cols.B)] | ||
| 06)----------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 | ||
| 07)------------CoalesceBatchesExec: target_batch_size=8192 | ||
| 08)--------------FilterExec: A@0 > 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key fix, seeing the FilterExec pushed right on top of the datasource exec.
Without the fix, the plan would look like this (with the filter executed AFTER the aggregation):
+ 02)--CoalesceBatchesExec: target_batch_size=8192
+ 03)----FilterExec: A@0 > 10
+ 04)------AggregateExec: mode=SinglePartitioned, gby=[A@0 as A], aggr=[sum(test_uppercase_cols.B)]
+ 05)--------DataSourceExec: partitions=1, partition_sizes=[0]
e335907 to
bc38c43
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aditanase
| .iter() | ||
| .map(|e| Ok(Column::from_qualified_name(e.schema_name().to_string()))) | ||
| .map(|e| { | ||
| Ok(Column::from_qualified_name_ignore_case( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to study this more carefully, but this change to ignore case seems inconsistent with the rest of this file and also potentially incorrect
For example:
- In for dialects where case is not important, universally ignoring the case could push down the wrong fields
- For dialects where case is important, there seem to be many other places in this module that do not compare the case 🤔
So it seems:
- We should be checking the case sensitivity config setting before comparing
- Are there other places that should be changed to take name into account 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb thanks for taking a look. I agree the fix looks fishy, and would love to better understand the mechanics of case comparison. We can keep the tests and come up with a better fix that addresses your concerns.
It boils down to the ignore_case flag which is not very intuitive, as setting to true ends up keeping the UPPERCASE, while setting to false will lowercase the column name from the gby expression:
https://github.com/apache/datafusion/blob/main/datafusion/common/src/utils/mod.rs#L298-L299
The problem is introduced as we're roundtripping the correctly parsed column name
Column { relation: Some(Bare { table: "test" }), name: "A" }
through
Column::from_qualified_name(e.schema_name().to_string())
Which will lowercase it, as now there are no more double quotes around the field name. That's why I switched to Column::from_qualified_name_ignore_case, to avoid lowercasing as we're already starting from parsed/cased names.
Very open to suggestions
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
PushDownFilter does not push a predicate when the table has columns that are not all lowercase. Tried with and without
enable_ident_normalization- no change. The logic insideparse_identifiers_normalizeddoes not seem to properly detect quotes and it will lowercase the column used in the group by expression.Here's the query I used, just for illustration:
Expected query plan:
Actual query plan:
An alterate fix could use
expr_to_columnsto extract the columns, as in Unnest above:Question: should we make the same change in the Window functions branch?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes, from a client application.
I did not add any unit tests, none of the existing tests in this module use upper case columns. Tried to add another table/schema, but then the test was failing, I am unsure of how to control the lowercasing of column names.
Are there any user-facing changes?